[Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. by bnellnm · Pull Request #27123 · vllm-project/vllm

bnellnm · 2025-10-17T20:53:40Z

Purpose

Make a new FusedMoEModularMethod subclass of FusedMoeMethodBase for use with modular kernels.

Instead of having every subclass of FusedMoEMethodBase check self.fused_experts, we swap out the quant_method of the FusedMoE layer to an instance of FusedMoEModularMethod. This will reduce the complexity of the various FusedMoEMethodBase subclass apply methods and isolate uses of modular kernels to the new class.

Test Plan

Ran by hand on some fp8 + modelopt models.
CI tests

Test Result

cc @varun-sundar-rabindranath , @wenscarl

mergify · 2025-10-17T20:54:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector

💡 Codex Review

https://github.com/vllm-project/vllm/blob/474381baec872bc9f45e221754d420f41b93ace0/vllm/model_executor/layers/fused_moe/layer.py#L2115-L2118
Accessing missing fused_experts attribute

The commit removes fused_experts from FusedMoEMethodBase, but FusedMoE still unconditionally accesses self.quant_method.fused_experts here (and again later when staging tokens). When the quant method does not use a modular kernel—e.g. AWQ, BitsAndBytes, RTN—init_prepare_finalize now leaves the original quant method in place and it no longer defines a fused_experts attribute. These checks will therefore raise AttributeError before any routing happens. The guard should use hasattr or using_modular_kernel instead of dereferencing the attribute directly.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request refactors the handling of modular kernels for Fused MoE layers by introducing a FusedMoEModularMethod wrapper. This is a good simplification that centralizes logic. However, I've identified two critical issues that could lead to runtime errors. One is related to an incorrect condition for EPLB support in the FP8 quantization method, and the other is an incorrect API usage for submodule replacement. I have provided detailed comments and suggestions to address these issues.

mergify · 2025-10-23T00:20:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2025-10-24T22:13:16Z

Thanks @bnellnm . This cleans up a bunch of redundant code 🙌 .

I have a suggestion. IIUC, the function call chain for the construction of FusedMoEModularMethod looks something like follows,

1. `DeviceCommunicatorBase::prepare_communication_buffer_for_model()` calls  `FusedMoE::init_prepare_finalize()` 
2. `FusedMoE::init_prepare_finalize()` calls `FusedMoEMethodBase::init_prepare_finalize()` and it returns the `FusedMoEModularKernel` object
3. `FusedMoE::init_prepare_finalize()` then makes a `FusedMoEModularMethod` object and overrides its `self.quant_method`

Here, note that FusedMoEMethodBase::init_prepare_finalize() calls FusedMoEMethodBase::maybe_make_prepare_finalize() which in turn calls a static function FusedMoEMethodBase::_maybe_make_prepare_finalize() which does most of the work anyways.

My suggestion is to move FusedMoEModularMethod into its own file and expose a function say maybe_make_fused_moe_modular_method() that attempts to construct the FusedMoEModularMethod object.

That way, we can get rid of most of the ModularKernel specific code from fused_moe/layer.py and have it in a different file thus cleaning up fused_moe/layer.py greatly.

What do you think ?

I am not suggesting we do it in this PR. I can take it up as well 👍

bnellnm · 2025-10-24T22:32:03Z

Thanks @bnellnm . This cleans up a bunch of redundant code 🙌 .

I have a suggestion. IIUC, the function call chain for the construction of FusedMoEModularMethod looks something like follows,
1. `DeviceCommunicatorBase::prepare_communication_buffer_for_model()` calls  `FusedMoE::init_prepare_finalize()` 
2. `FusedMoE::init_prepare_finalize()` calls `FusedMoEMethodBase::init_prepare_finalize()` and it returns the `FusedMoEModularKernel` object
3. `FusedMoE::init_prepare_finalize()` then makes a `FusedMoEModularMethod` object and overrides its `self.quant_method`
Here, note that FusedMoEMethodBase::init_prepare_finalize() calls FusedMoEMethodBase::maybe_make_prepare_finalize() which in turn calls a static function FusedMoEMethodBase::_maybe_make_prepare_finalize() which does most of the work anyways.

My suggestion is to move FusedMoEModularMethod into its own file and expose a function say maybe_make_fused_moe_modular_method() that attempts to construct the FusedMoEModularMethod object.

That way, we can get rid of most of the ModularKernel specific code from fused_moe/layer.py and have it in a different file thus cleaning up fused_moe/layer.py greatly.

What do you think ?

I am not suggesting we do it in this PR. I can take it up as well 👍

Yeah, that's a good idea. I was also considering splitting up layer.py in different ways, e.g. move UnquantizedMoEMethod to a separate file.

I'd rather do that in a separate PR though.

varun-sundar-rabindranath · 2025-10-30T14:15:53Z

+                if layer.w2_weight is None
+                else layer.w2_weight
+            )
+            assert all([w is not None for w in [layer.w13_weight, layer.w2_weight]])


I think this setting of layer.w13_weight and layer.w2_weight better fits in the process_weights_after_loading function here

vllm/vllm/model_executor/layers/quantization/mxfp4.py

Line 742 in 0fe0140

layer.w2_weight = None

That way we can get rid of having to differentiate between the w13_weight_triton_tensor/w2_weight_triton_tensor and w13_weight/w2_weight .

Not suggesting for this PR. Fixing that I think should be its own PR.

varun-sundar-rabindranath

LGTM ! Very nice cleanups ! Thanks @bnellnm

Signed-off-by: Bill Nell <bnell@redhat.com>

wangshangsam · 2025-11-06T02:43:28Z

@bnellnm I have a ... maybe dumb ... question - how exactly is each derived MoEMethod class going to trigger FusedMoEModularMethod.apply() (thereby using the modular kernels)? Doesn't each subclass override the .apply() completely?

bnellnm · 2025-11-06T02:55:01Z

@bnellnm I have a ... maybe dumb ... question - how exactly is each derived MoEMethod class going to trigger FusedMoEModularMethod.apply() (thereby using the modular kernels)? Doesn't each subclass override the .apply() completely?

The FusedMoE layer calls self.quant_method.apply so if no modular kernel has been constructed, this will invoke an apply method on some subclass of FusedMoEMethodBase. Now, when a modular kernel gets created, the FusedMoE layer will swap out self.quant_method with an instance of FusedMoEModularMethod which will call the modular kernel instead.

So, subclasses of FusedMoEMethodBase no longer need to worry about modifying apply for modular kernels.

…ses. (vllm-project#27123)

bnellnm requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 17, 2025 20:53

mergify Bot added the needs-rebase label Oct 17, 2025

chatgpt-codex-connector Bot reviewed Oct 17, 2025

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated

gemini-code-assist Bot reviewed Oct 17, 2025

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated

Comment thread vllm/model_executor/layers/quantization/fp8.py

bnellnm force-pushed the swap-quant-method branch from 4e996e7 to 1d40f7f Compare October 18, 2025 20:09

mergify Bot removed the needs-rebase label Oct 18, 2025

bnellnm changed the title ~~[Kernels] Swap quant method~~ [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. Oct 20, 2025

mergify Bot added the needs-rebase label Oct 23, 2025

bnellnm force-pushed the swap-quant-method branch from 2c03071 to 4d8d68f Compare October 23, 2025 17:36

bnellnm requested a review from pavanimajety as a code owner October 23, 2025 17:36

bnellnm force-pushed the swap-quant-method branch from 4d8d68f to de74947 Compare October 24, 2025 20:05

mergify Bot removed the needs-rebase label Oct 24, 2025